-
Notifications
You must be signed in to change notification settings - Fork 53
add CPU platforms to instances #8728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
RFD 505 proposes that instances should be able to set a "minimum hardware platform" or "minimum CPU platform" that allows users to constrain an instance to run on sleds that have a specific set of CPU features available. This allows a user to opt a VM into advanced hardware features (e.g. AVX-512 support) by constraining it to run only on sleds that support those features. For this to work, Nexus needs to understand what CPUs are present in which sleds. Have sled-agent query CPUID to get CPU vendor and family information and report this to Nexus as part of the sled hardware manifest.
fb1e0a7
to
620d7c9
Compare
the existing plumbing was sufficient for sled-agent to report the CPU family at startup, but did not provide the CPU family when Nexus calls later for inventory collections. when you've upgraded to this version, the database migration sets the sled CPU family to `unknown` expecting that the next inventory collection will figure things out. this doesn't happen, and the initial check-in doesn't update the CPU type either (presumably because the sled is already known and initialized from the control plane's perspective?) this does... most of the plumbing to report a sled's CPU family for inventory collection, but it doesn't actually work. `SledCpuFamily` being both in `omicron-common` and `nexus-client` is kind of unworkable. probably need a `ConvertInto` or something to transform the shared into the `nexus-client` when needed..? i've been trying to figure out what exactly is necessary and what is just building a mess for myself for two hours and this feels like it's going nowhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally got to taking a fine-tooth comb through the CPUID bits here and differences between hardware and what guests currently see. for the most part this is in line with what guests already get from byhve defaults but i've noticed a few typos that unsurprisingly do not pose an issue booting at least Alpine guests. i'll clean that up and update 314 appropriately tomorrow.
nexus/src/app/instance_platform.rs
Outdated
// See [RFD 314](https://314.rfd.oxide.computer/) section 6 for all the | ||
// gnarly details. | ||
const MILAN_CPUID: [CpuidEntry; 32] = [ | ||
cpuid_leaf!(0x0, 0x0000000D, 0x68747541, 0x444D4163, 0x69746E65), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a guest currently sees eax=0x10
here, where leaves 0xe
, 0xf
, and 0x10
are all zeroes. 0xe
is reserved and zero on the host. 0xf
and 0x10
are zeroed because of the default byhve masking behavior. setting eax=0xd
is just more precise: leaves 0xf
and 0x10
being "present" but zero does not communicate any feature presence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know we reference RFD 314 but it kinda seems like it would be nice to have the comment explaining this here?###
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i went a bit hard on the comment and now there's like 600 more lines of more-a-code-than-a-comment (https://github.com/oxidecomputer/omicron/pull/8728/files#diff-0c754f739cd972f46b539d2a2e5a6220cd0b72e8c9fe7f20a2013fab3f28aa21R167)
nexus/src/app/instance_platform.rs
Outdated
0xB, 0x0, 0x00000001, 0x00000002, 0x00000100, 0x00000000 | ||
), | ||
cpuid_subleaf!( | ||
0xB, 0x1, 0x00000000, 0x00000000, 0x00000201, 0x00000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same as RFD 314 says, but eax
ought to be different in the RFD. Propolis fixes up B.1.EBX
here to the number of vCPUs, but eax
of 0 implies that the subleaf is actually invalid (from the APM: "If this function is executed with an unimplemented level (passed in ECX), the instruction returns all zeroes in the EAX register." .. also taken faithfully, this implies the VM's topology is vCPU-many sockets with SMT pairs across each socket pair. oops).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably update the RFD to reflect that? also lol Oxide is the ONLY platform that lets you make a 64-socket VM...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, it turns out the RFD is fine here, there's a later sentence that says "Index 1's eax
and ebx
depend on the VM shape". But Propolis is wrong here: oxidecomputer/propolis#939
So I'm going to hide this leaf in the proposed Milan v1 platform, fix the Propolis bug, and we can expose leaf Bh in a v2.
RFD 505 proposes that instances should be able to set a "minimum hardware platform" or "minimum CPU platform" that allows uers to constrain an instance to run on sleds that have a specific set of CPU features available. Previously, actually-available CPU information was plumbed from sleds to Nexus. This actually adds a `min_cpu_platform` setting for instance creation and uses it to drive selection of guest CPUID leaves. As-is, this moves VMs on Gimlets away from the byhve-default CPUID leaves (which are effectively "host CPUID information, but features that are not or cannot be virtualized are masked out"), instead using the specific CPUID information set out in RFD 505. There is no provision for Turin yet, which instead gets CPUID leaves that look like Milan. Adding a set of CPUID information to advertise for an `amd_turin` CPU platform, from here, is fairly straightforward. This does not have a mechanism to enforce specific CPU platform use or disuse, either in a silo or rack-wide. One could imagine a simple system oriented around "this silo is permitted to specify these minimum CPU platforms", but that leaves uncomfortable issues like: if a silo A permits only Milan, and silo B permits Milan and Turin, all Milan CPUs are allocated already, and someone is attemting to create a new Milan-based VM in silo A, should this succeed using Turin CPUs potentially starving silo B?
…platform as defined in Nexus
620d7c9
to
5cf7b9c
Compare
@@ -612,6 +613,7 @@ mod tests { | |||
external_ips: vec![], | |||
disks: vec![], | |||
boot_disk: None, | |||
cpu_platform: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure is weird that this file doesn't include a change to SledReservationConstraintBuilder
like in instance_start.rs
huh?
that's a bug. this currently would allow a Turin to migrate onto a Milan (bad)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a cursory initial review. Beyond the things you've already identified, this change seems like it's going in the right direction.
/// hypervisor support may be masked off. See RFD 314 for specific CPU features | ||
/// in a CPU platform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is RFD 314 public, and do we intend to make it public? If we're referencing it in a comment that will eventually make it into the public API docs, perhaps we should?
* If this is NULL, the control plane ignores CPU constraints when selecting | ||
* a sled for this instance. Then, once it has selected a sled, it supplies | ||
* a "lowest common denominator" CPU platform that is compatible with that | ||
* sled to maximize the number of sleds the VM can migrate to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turbo nit, sorry: it might be nice to have a similar comment in the db model Instance
type --- when reading the type definition, i wasn't sure why this was null, so had to go check dbinit for it
KnownVersion::new(175, "add-instance-cpu-platform"), | ||
KnownVersion::new(174, "sled-cpu-family"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take it or leave it, but i think it would've been fine to do this in a single migration --- though maybe separating them is nicer since they change different tables...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually a weird github thing.. as a draft this was on top of another PR that added sled CPU family reporting, which is merged, and i think was migration 174! i thought this would evaporate out of the diff. i'll have to take a closer look at what's going on with that..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, so when landing the sled CPU family PR i had to bump the schema number a bit so as to actually make it the latest, and i did not rebase this PR on the commit that finally went in. i've merged main
and this whole diff is much more reasonable now.
// VMMs get the "sled default" CPU platform when an instance starts | ||
// up on a sled that hasn't reported a well-known CPU family. Assume | ||
// that nothing is known about the VM's compatible CPU platforms in | ||
// this case. | ||
Self::SledDefault => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this...feels like some combination of reasonable and sketchy. i kinda wonder if we ought to just never schedule instances on sleds that haven't reported a well-known CPU family? i want to not have a situation where a VM experiences a weird blip in virtual platform due to getting scheduled on a Mystery Sled..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah.. this is kind of a provision to make test environments not require Exactly A Zen 3 Or Zen 5 Or Carry Your Own Patches - to get here you have to have been scheduled on a SledCpuFamily::Unknown
, and for that to have happened either we've shipped a sled-agent
that doesn't know how to identify a sled's CPU anymore, or we've shipped a sled that sled-agent
doesn't know about and placed instances on it before updating it. (even if you set the instance CPU platform to null
through the API you end up at the minimum compatible platform for the selected sled, so there's no way to get here from just API actions)
the good/bad news is that in this case you won't be able to migrate the instance anyway, so it's a "blip" in that the instance will just be this way until you stop and start it again..
", | ||
); | ||
|
||
// TODO(gjc): eww. the correct way to do this is to write this as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//TODO(gjc):
😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greg would be glad to know that the way to spell this that Diesel and Cockroach are happy about is fully baffling: 75ed627
nexus/src/app/instance_platform.rs
Outdated
0xB, 0x0, 0x00000001, 0x00000002, 0x00000100, 0x00000000 | ||
), | ||
cpuid_subleaf!( | ||
0xB, 0x1, 0x00000000, 0x00000000, 0x00000201, 0x00000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably update the RFD to reflect that? also lol Oxide is the ONLY platform that lets you make a 64-socket VM...
// If the instance didn't supply a CPU platform, select one for this VMM by | ||
// looking up the chosen sled and selecting the "minimum compatible | ||
// platform" for sleds of that lineage. This maximizes the number of sleds | ||
// that can host the VMM if it needs to migrate in the future. Selecting the | ||
// sled first and then deriving the platform is meant to support | ||
// heterogeneous deployments: if a deployment contains some sleds with CPUs | ||
// from vendor A, and some with CPUs from vendor B, then selecting the sled | ||
// first implicitly chooses a vendor, and then the "minimum compatible" | ||
// computation selects the most compatible platform that can run on sleds | ||
// with CPUs from that vendor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially felt a bit uncomfy about leaving it unbound until the instance is started (mainly just because I was trying to think my way out of having the db column be nullable), but I think that not picking something until the instance is started is the right move, so that it can be based on the sled that ends up hosting the instance, and we can select that sled in a less constrain-y way.
/// The CPU platform to be used for this instance. If this is `null`, the | ||
/// instance requires no particular CPU platform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this comment generates API docs, perhaps we should discuss what happens if you set it to null
here? I would want to know that "the instance requires no particular CPU platform" also means "...and will be implicitly bound to the CPU platform of whatever the first sled to host it is", rather than meaning "we will always let you schedule that instance anywhere and if the guest is confused by this it's your fault"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"we will always let you schedule that instance anywhere and if the guest is confused by this it's your fault"
that is actually what it means lol, so yes definitely this deserves a note. the platform that's bound is on the VMM, specifically so that we can't migrate you to a different kind of system once your instance is running. but if you stop it and start it again without an explicit CPU platform you can go anywhere.
sled-hardware/types/src/lib.rs
Outdated
|
||
impl std::fmt::Display for SledCpuFamily { | ||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||
write!(f, "{}", self.as_str()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turbo nit: slightly simpler to just do
write!(f, "{}", self.as_str()) | |
f.write_str(self.as_str()) |
* leaf 1 should *not* have x2APIC (yet) * leaf 7 EDX bit 4 indicates FSRM, should be set
also rustfmt, resolve warnings, and fill in explicit zero leaves so that the RFD 314-defined CPUID table matches what the builder produces now. yay!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frankly i did not realize how nice being able to write a raw_cpuid::CpuId
would be, and how much more legible that makes this PR (and RFD 314, and future CPU platforms, and now we can pretty trivially make Propolis validate CPUID bits its told, and ...)
} | ||
|
||
#[test] | ||
fn milan_rfd314_is_as_described() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other cool thing we can do now is grab a CPUID dump from a guest before using explicit definitions and include a test describing how it and the RFD 314 definitions differ. i want to include that alongside this test before merging (would have been so much nicer than the manual comparison i did before...)
short history here: this bit is set in hardware, the extended space is not supported by bhyve, but the bit was passed through. hiding this bit "should" be fine, but out of caution we're not doing that in the first statically-defined CPU platform. this bit was not set in the MILAN_CPUID blob, though, even though the RFD was updated.
these tests are: * the `milan_rfd314` function computes something that matches RFD 314 * the RFD 314 definition differs in specific and justified ways from pre-314 guest CPUID profiles.
the new |
For a reason I can't figure out, when this was ``` AND sled.cpu_family IN (...) ``` binding a parameter made that query end up like ``` AND sled.cpu_family IN ($1,) ``` which then expects a single `sled_cpu_family` enum for later binding. Binding an array then yielded a type error like ``` DatabaseError(Unknown, "invalid cast: sled_cpu_family[] -> sled_cpu_family") ``` ... but writing it like `AND sled.cpu_family = ANY (...)` does not get the extra comma and instead renders like `ANY ($1)` which happily takes a `sled_cpu_family[]`.
this materializes RFD 314 and in some respects, 505.
builds on #8725 for CPU family information, which is a stand-in for the notion of sled families and generations described in RFD 314. There are a few important details here where CPU platforms differ from the sled CPU family and I've differed from 314/505 (and need to update the RFDs to match). I'd not noticed the sheer volume of comments on https://github.com/oxidecomputer/rfd/pull/515 so I'm taking a pass through those and the exact bits in
MILAN_CPUID
may be further tweaked. I suspect the fixed array needs at least a few more tweaks anyway, cross-referencing RFD 314 turns out to make for awkward review and it's hard to eyeball the semantic of bits here (or which are to be filled in by some later component of the stack!)As-is: I think would be OK to merge but is not quite as polished as I'd like it to be, so it's a real PR but I expect further changes.
hardware CPU families are less linear than Oxide CPU platforms.
We can (and do, in RFD 314) define Milan restrictively enough that we can present Turin (and probably later!) CPUs to guests "as if" they were Milan. Similarly I'd expect that Turin would be defined as roughly "Milan-plus-some-AVX-512-features" and pretty forward-compatible. Importantly these are related to but not directly representative of real CPUs; as an example I'd expect "Turin"-the-instance-CPU-platform to be able to run on a Turin Dense CPU. Conversely, there's probably not a reason to define a "Turin Dense" CPU platform since from a guest perspective they'd look about the same.
But at the same time the lineage through the AMD server part family splits at Zen 4 kind of, with Zen 4 vs Zen 4c-based parts and similar with Zen 5/c. It's somewhat hard (I think) to predict what workloads would be sensitive to this. And as #8730 gets into a bit, the details of a processor's packaging (core topology, frequency, cache size) can vary substantially even inside one CPU family. The important detail here is that we do not expect CPU platforms to cover these details and it would probably be cumbersome to try; if the instance's constraint is "I want AVX256, and I want to be on high-frequency-capable processors only", then it doesn't actually matter if it's run on a Turin or a Milan and to tie it to that CPU platform may be overly restrictive.
On instance CPU platforms, the hope is that by focusing on CPU features we're able to present a more linear path as the microarchitecture grow.
instance platforms aren't "minimum"
I've walked back the initial description of an instance's CPU platform as the "minimum CPU platform". As present in other systems, "minimum CPU platform" would more analogously mean "can we put you on a Rome Gimlet or must we put you on a Milan Gimlet?", or "Genoa Cosmo vs Turin Cosmo?" - it doesn't seem possible to say "this instance must have AVX 512, but otherwise I don't care what kind of hardware it runs on.", but that's more what we mean by CPU platform.
In a "minimum CPU platform" interpretation, we could provide a bunch of Turin CPUID bits to a VM that said it wanted Milan. But since there's no upper bound here, if an OS has an issue with a future "Zen 14" or whatever, a user would discover that by their "minimum-Milan" instance getting scheduled on the new space-age processor and exploding on boot or something. OSes shouldn't do that, but...
Implementation-wise, this is really just about the names right now. You always get Milan CPUID leaves for the time being. When there are Turin CPUID leaves defined for the instance CPU platform, and Cosmos on which they make sense, this becomes more concrete.
"are these CPU platforms compatible?"
RFD 314 deserves a section outlining how we determine a new CPU can be a stand-in for an older platform, talking about this and other leaves (particularly if we pass topology information through, elsewhere). I've added a few words about the problem of a CPUID profile being reasonable to present on some hardware's different CPUID profile, but I've also outlined what I've spotted as potential incompatibilities in the currently-dead
functionally_same()
. This is one of those things that will have the shape of "very boring and straightforward for eight years, and then a bit will change (or be defined) that's kind of annoying".